Machine Learning Model for the Planetary Albedo

Part one

Import Packages

Load dataset

Regression

Divide the data into two halves (left and right side of the Albedo), train on one side (left) and predict the other.

We perform the regression with simpler models, evaluate and see if they perform appropriately to the problem.

Starting with the Classic Linear Regression model: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Let's evaluate the coefficient of determination $R^2$ of the prediction

It's not that bad, but we can improve. Let's try the Ridge Regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html#sklearn.linear_model.RidgeCV

There was almost no improvement. Let's see the Lasso: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV

Elastic Net: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html

Continues that way. Let's try something different, a random forest regressor: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor

Much better! Through the metric $R^2$ we can select the model RandomForestRegressor to make predictions

Visually checking

Looks nice! Remembering: more purple, closer to the lower value

As you can see, the residues were very low.

Part two

Load dataset

There are many gaps! (null values).

Filling in the gaps

As we do not have many input variables, we will build some from the statistical measures in our windowed database. So, to help find relationships between albedo and chemical composition in the top of the planet, we use this ideia in time domain.

As the following models take longer, we will try to "kick" some hiperparameters

It's good, but let's try a simpler neural network: https://scikit-learn.org/stable/modules/neural_networks_supervised.html#regression

We can also try something newer: XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data: https://xgboost.readthedocs.io/en/latest/python/python_api.html

As we can see, Random Forest Regressor had the best performance. We will choose it to fill in the gaps:

Seeing how the filling took place

We would probably need a more complex model to fill these gaps. This article could give a good idea: https://www.hindawi.com/journals/cin/2016/6156513/

Prediction

Now we take advantage of the three best adjusteds models to make predictions in Mercury botton

As can be seen in both models, mainly the Random Forest Regressor that presented the best fit by $R^2$, there was no good performance in the predict. This was very linear, as can be seen by the plots.

With that, I confirm the hypothesis that more complex (or appropriate) models should be applied to solve this problem. From filling the gaps (presenting in a linear way) to the prediction for the bottom.